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Abstract 

We introduce the concept of self-healing in the field of complex networks modelling; in particular, self-healing capabilities 
are implemented through distributed communication protocols that exploit redundant links to recover the connectivity of 
the system. We then analyze the effect of the level of redundancy on the resilience to multiple failures; in particular, we 
measure the fraction of nodes still served for increasing levels of network damages. Finally, we study the effects of 
redundancy under different connectivity patterns — from planar grids, to small-world, up to scale-free networks — on healing 
performances. Small-world topologies show that introducing some long-range connections in planar grids greatly enhances 
the resilience to multiple failures with performances comparable to the case of the most resilient (and least realistic) scale- 
free structures. Obvious applications of self-healing are in the important field of infrastructural networks like gas, power, 
water, oil distribution systems. 
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Introduction 

In the field of complex networks [1,2], most studies have been 
focused on how to improve the robustness (i.e. the capability of 
surviving intentional and/or random failures) of existing net- 
works [3]. Much less has been done regarding the resilience (i.e. the 
capability of recovering failures). In fact, implementing smart (as 
well as economic) strategies aimed at maintaining high level of 
performances is a crucial issue yet to be solved and represents one 
of the most pressing and interesting scientific challenge. A most 
important field of application for the results of such investigations 
are infrastructural networks. Infrastructural networks are the 
backbone of our society that critically depends on the continuity of 
functioning of systems like power, gas or water distribution. 

As a standard, infrastructural networks have been designed to 
be resilient at least to the loss of a single component; on the other 
hand, their constantly growing size of has increased the possibility 
of multiple failures which often have not been considered in their 
original design. In general, implementing the possibility of 
recovering from any sequence of k failures requires an exponentially 
growing effort in means and investments; it is therefore viable to 
consider implementing systems that are able to recover from k 
failures on average: in this paper we will follow such a statistical 
approach. 

In the field of communication [4—6] and wireless networks [7— 
10] self-healing algorithms have recently been the subject of 
massive investigation. In general, such strategies aimed at 
maintaining network connectivity assume the possibility of 
creating anew communications channels among the nodes of the 
networks, often with no constraints on the number of new 
connections available [11]. This is no the case in infrastructural 
networks, where the possibility to create new links among nodes is 



normally not available (at least in the short run), since links are 
physical (fixed in advance) and creating new ones requires both 
time and investments. 

In general, self-healing in infrastructural networks should be 
though as a constrained mechanism in which only a limited 
amount of resources is available. An example of such an approach 
can be found in material science where new polymeric compounds 
are capable of self healing due to the presence of small amounts of 
healing agents that gets released and activated upon cracking 
[12,13]. An alternative strategy to ensure the continuity of a 
system is to ensure redundancy in the interconnectivity of its 
components; for example, when a hole is punched in a leaf, the 
remaining vessels are capable to sustain the extra flow necessary to 
keep the tissues alive [14]. 

Infrastructural networks are very well engineered systems 
characterised by fluxes of commodities (from electric power to 
drinking water). In this paper we consider a simplified description 
of such systems in terms of complex networks with a simple 
dynamical process describing the flow of a commodity from one or 
more sources (production) to several sinks (consumption). We the 
introduce a novel healing strategy based on the activation of fixed 
redundant resources (backup links) via a generic routing algorithm 
and study the resilience of the networks to multiple failures. The 
presence of such backup links is customary in technological 
networks; hence, our self-healing procedure is within the reach of 
current technology. As an example, urban low-voltage distribution 
power grids have an almost planar topology and are essentially 
radial (tree-like) networks with few inactive backup-links that can 
be activated (often manually) to restore power in case of failures. 
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Results 

Model 

In our scenario the system is assumed to describe a network that 
distributes some utility following flow conservation analogous to 
KirchofFs current law; examples of systems following such 
constrains are not only power grids, but also the flows of fluids 
in distribution networks like water, gas, oil (at least at stationarity). 
As a further simplification, we will consider a single node to be the 
source of the commodity distributed on the network and we will 
not consider any constrain on the amount of flow that can be 
transported by any link; hence, connectivity among a sink and a 
source in enough to have the sink served. In our scenario, all the 
nodes (except the source) are considered to be potential sinks. 
Hence, to serve as many nodes as possible, connectivity must be 
maximized. 

A further assumption is that at each instant of time, the topology 
of the network distributing the commodity is a tree (the active tree); 
in fact, such a structure meets the infrastructures' managers needs 
- i.e., to measure (for billing purposes) in an easy and precise way 
how much of a given quantity is served to any single node of the 
network. In networks - like drinking water - where such 
assumption is not strictly true, very few loops (i.e. low meshedness) 
are present [15]. 

To model active trees, we start from an underlying network 
topology and build up a spanning tree. The set of all possible 
redundant links is exactiy the set of links in the network not 
belonging to the spanning tree. In order to allow for recovery, we 
also consider the presence of dormant backup links - i.e., a set of 
links that can be switched on - as in the case urban of low-voltage 
distribution power grids. 

While commodities can be transported only via active links, in 
order to implement our self-healing strategy we assume that nodes 
are able to communicate by means of a suitable distributed 
interaction protocol only with the set of neighbouring nodes, i.e. 
the ones connected either via active or via dormant links. 
According to our procedure, when either a node or a link failure 
occurs, all the nodes that cannot be served - i.e. there is no path to 
the source - disconnect from the active tree. Afterwards, unserved 
nodes try to reconnect to the active tree by waking up (activating) 
through the protocol some of their dormant backup links. Such a 
process reconstructs a new active-tree that can restore totally or 
partially the connectivity, i.e. heals the system. A more formal 
description of the self-healing procedure and of the simulation 
protocols are provided in the Methods section. 

A natural metric to quantify the success of such a procedure is 
the fraction of served nodes (FoS). In order to identify the system's 
properties that are able to maximize the FoS we study the effects 
of varying the fraction of backup links (redundancy) according to 
different underlying connectivity patterns with respect to multiple 
random failures. 

In order to stress the peculiarities of different network structures, 
we generate class of graphs with different connectivity patterns (see 
Methods). We start our investigation by focusing on the underlying 
topology which often resembles the actual situation of infra- 
structural networks - i.e. nodes disposed over a planar square grid 
(SQ). Then, we stress the role of the underlying networks' 
connectivity patterns by using the scale-free (SF) topology 
generated according to Barabasi- Albert [16] and the small-world 
(SW) topology generated according to Watts and Strogatz [ 1 7] . All 
the initial network structures are generated by using the IGRAPH 
library [18]. 

To generate the random spanning trees associated to each kind 
of network structures, we use the flat sampling algorithm of Wilson 



[19]. We take such spanning trees as the initial configuration of 
our model distribution networks. The links not belonging to the 
spanning trees form the set of the possible backup links of our 
system; among such links, we choose a random fraction r of dormant 
links that can be used to heal the system. We then simulate the 
occurrence of uncorrelated multiple failures by deleting at random 
k links of the initial active tree. Notice that link failures are the 
most general ones, as a node failure is equivalent to the 
simultaneous failure of all its links. 

The source node - i.e, the root of the oriented active tree - is 
chosen at random within all the nodes of the underlying network. 
The only exception is the case of the SF networks where we use, 
according to the preferential attachment principle, the natural 
choice of having the node with the highest number of neighbours 
(the central hub) as the source. 

Our self-healing algorithm is a routing protocol (see Methods) 
whose goal is to reconstruct the maximum spanning tree 
connected to the source after that a failure has occurred; in doing 
so, we use both the survived links of T and the dormant links D; 
fig. 1 illustrates such procedure. After the recovery, we calculate 
FoS the fraction of nodes connected to the source after the 
recovery. 

Effects of networks' topology 

In order to test the performances of our healing algorithm to 
failures in terms of the service provided after the active tree 
restoration, we simulate the model for increasing number of 
failures. Recalling that each failure causes a cascade - i.e, each 
node of sub-tree served by the broken link is unserved - we 
investigate the role of redundancy r on different topologies. 

We start our study by addressing planar square grid {SQ) 
networks since they are the most similar to the real physical 
networked infrastructures. In the first scenario, we generate 
spanning trees on a square grids; fig. 2(a) shows the variation of the 
restored FoS respect to the number of failures k for different 
redundancies rs. For square grids, we do not observe any relevance 
of the redundancy on the FoS; this means that a very small 
fraction backup links (r = 0.l, i.e. 10%) already suffice to attain the 
maximum resilience. 

The situation is completely different when the underlying 
topology is a scale-free network generated through the Barabasi- 
Albert model [16]. A widely diffused property of real networks is 
that the connectivity pattern follows a scale-free power-law 
distribution [1,20,21]. This feature has been found to be a 
consequence of the so called preferential attachment — i.e networks 
expand continuously through the addition of new vertices which 
attach preferentially to already well connected nodes. Although 
technological networks do not show power law degree distributions 
due to economic and spatial constraints [22], we choose to 
investigate SF networks for their marked robustness upon random 
failures [23]. For SF networks, it is natural to choose the node 
with the highest degree (the hub) as the source. The quality of 
service restored by our self-healing algorithm on SF networks is 
shown in fig. 2(b). As expected, we find that SF networks can 
easily recover connectivity to all the nodes even for low 
redundancies. Such error tolerance comes at a high price of being 
extremely vulnerable to node targeted attacks: isolating the hub 
disconnects the whole system. High error tolerance and targeted 
attack vulnerability are indeed generic properties of SF networks 
[24]. 

We then consider the case of small-world (SW) networks 
generated according the Watts-Strogatz rewiring procedure [17]. 
In the case of technological networks, small-world networks are 
important since they highlight the effects of introducing long-range 
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(a) Initial configuration (b) Link failure (c) Recovered Network 

Figure 1 . Example of healing after single link failure. Notice that failure of a single node can be modelled as the failure of all its links; hence, 
multiple links failure are the more general event to be considered. (Left Panel) In the initial state, the source node (filled square, upper left corner) is 
able to serve all 16 nodes through the links of the active tree. The 4 dashed lines (green online) represent dormant backup links that can be activated 
upon failure. The redundancy of the system is p = 4/9 as only 4 of the 9 possible backup links are present. The link marked with an X is the one that is 
going to fail. (Central panel) A single link failure disconnects all the nodes of a sub-tree; in the example, a sub-tree of 6 nodes (red online) is left 
isolated from the source - i.e., the system has a damage A = 6). (Right Panel) By activating a single dormant backup link, the self-healing protocol has 
been able to recover connectivity for the whole system, in this case bringing back the number of served nodes at its maximum value 16. The link that 
has recovered the connectivity is marked with an R. 
doi:1 0.1 371 /journal.pone.0087986.g001 



links in a planar topology. Starting from an initial graph (planar 
square grids in our case), we rewire with a probability p a link with 
a randomly selected node; in this way we can interpolate from the 
case of SQ networks (p = 0) to the case of a random graph (p= 1). 
As in the case of simple percolation [25], the rewiring procedure 
introduces some long range links - i.e., between distant nodes on the 
square grid) that improve the robustness to random failures. 

In order to understand the role of the connectivity pattern we 
study our model on different S W networks with different rewiring 
probabilities. In fig. 2(c) we show the performances of our self- 
healing strategy with respect to an increasing number of failures. 
We see that a higher rewiring probability increases the number of 
served nodes after the restoration through the backup network; 
such a peculiarity shows up even if the clustering within 
neighbouring nodes (normally associated to a local robustness 
against failures) decreases; therefore long range links increase the 
possibility of the network staying connected even after multiple 
failures. 

Finally, we compare in fig. 3 the effectiveness of the self-healing 
protocol across different strategies. Notice that while distribution 
grids based on the SF topology are the more robust, they should 
be disregarded when considering the case of technological 
networks since economic and geometric constraints make SF 
networks unfeasible on planar topologies. 

Discussion 

In this paper we have introduced a minimal procedure of self- 
healing in networks. Such procedure exploits the presence of 
redundant edges to recover the connectivity of the system. Our 
scenario is inspired by real-world distribution networks that are - 
often for economic reasons - almost tree-like and at the same time 
are provided with alternative backup links that can be activated in 
case of malfunctioning. An example of such networks is the case of 
urban low-voltage distribution networks [26]. 

Our strategy could be readily and easily implemented with the 
current technologies. In fact, routing protocols represent a vast 
available source of distributed algorithms able to maintain the 
connectivity of a system; hence, our scheme could be implemented 
by the standard procedure of coupling an ICT network to a pre- 
existing infrastructure. Our strategy is an example in which 



interdependencies among two networks enhance the resilience 
instead of introducing catastrophic breakdowns [27]. 

By studying the performances of our procedure as a function of 
the redundancy on different underlying network topologies, we 
have shown that distribution networks akin to real world ones - i.e, 
based on planar lattices - are the less resilient to random failures. 
In fact, the most robust networks - as expected - are based on the 
SF topology; however, such a topology is unrealistic for 
technological networks. Our results on SW topologies hint that 
a very effective strategy to strengthen realistic networks is to add 
long range links. The feasibility of such a strategy would depend 
on the cost-benefit analysis about the implementation of these 
physical long-range links. A further direction of study would be to 
consider the effects of more detailed structural characteristics of 
the underlying network topologies [28] or even to consider 
biologically inspired designs, like dynamic networks inspired by the 
human brain [29]. 

While our minimal model considers only the connectivity of the 
system, it can be easily expanded to take account of the magnitude 
of the flows: in fact, routing algorithms can account for both the 
capacity of the links and dynamically swap re-routing of flows. 

Our model easily allows also for cold starts — i.e., for situations in 
which the network has shut down due to some major events (like a 
black-out) [30] . This is an important issue as one of the most time 
(and money) consuming activity after a major event is the restoring 
of the functionality of the network. 

In this paper, we have considered only the single source case. 
Next step is to consider a network served by multiple sources. In 
fact, the possibility of separating the system in trees would solve the 
who is serving who problem that appears as soon as more competitors 
share the same physical line in bringing power to their customers 
[31]. Moreover, the possibility for the system of dynamically 
separating in time-varying trees would allow for introducing a 
commodity market based on real-time economic competition 
among the owners of the sources. This further goal is not yet 
within the reach of current routing protocols and should be further 
investigated if we want to have grids that are smart not only for 
their ability to self-repair but also in optimizing consumptions and 
prices. Finally, we believe that studying and designing self-healing 
mechanisms in complex networks is a promising field of 
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(c) small-world 



Figure 2. Self-healing results for networks of size 10 4 . Panel (a): distribution networks based on square grids. The average fraction {FoS} of 
nodes that the self-healing protocol is able to restore decreases with the number of faults k with no relevant dependency on the redundancy; results 
are shown for a 10 4 nodes network. Panel (b): distribution networks based on scale-free networks generated according to Barabasi-Albert[16]. The 
average fraction of nodes<_fb5> of served nodes is plotted against the number of failures k. Even for a low 10% redundancy (r = 0.1), the system can 
almost totally heal after sustaining k~4 x 10 2 failures; as a comparison, for the same number of failures square grids loose ~90% of the nodes. Panel 
(c): distribution grids based on small-sworld networks obtained by rewiring a fraction /> = 0.2 of links according to Watts-Strogatz [17]. The average 
fraction of nodes<_fb5> of served nodes is plotted against the number of failures k. At difference with square grids and scale-free networks, the 
restored fraction of service FoS shows a marked dependency upon the redundancy parameter r. Similar results are obtained for y; = 0.1 and /; = 0.3. 
doi:1 0.1 371 /journal.pone.0087986.g002 



investigation where also the dynamics of the systems should be 
taken into account [32,33]. 

Materials and Methods 

The Self-Healing procedure 

We consider an abstract model of a physical networked 
infrastructure described by the quadruple N=(V,Vs,Ea,E_d). 
Here V are nodes of the network, VseV is the source node, Ea is 
the set of active links among the nodes and Ed denotes the set of 
dormant links that can be activated in order to heal system failures 
by re-connecting nodes. A node is considered to be served if it is 
connected to a source through a path of active links; all the nodes 
in V are initially connected to the source via a spanning tree. As 
the basic metric for any quality of service assessment, we consider 
the fraction of served nodes FoS counting the number of nodes in 
the active graph - i.e connected to v,s. 



More formally, in the initial configuration, the graph 
T = {V,Ea) is an instance of the set Rt(G) of all the random 
spanning tree of the underlying graph G=(V,E). Thus, T before 
the failures has \ V\ (active) nodes and |-EU|= V— 1 links among 
them. The set En of backup edges is taken form the remaining 
edges of the underlying graph G, i.e. EaUEd^E and 
E A (~)E D =0. The fraction r= \Ep>\/^E\ — (| V\ — 1)] measures 
the redundancy of N. 

We then consider the occurrence of multiple link failures. A k- 
failure is a subset Ep a Ea of k links chosen at random. The 
system right after a failure is described by the forest 
Tfail = ( V,Ea — Ep) and by the set Ep, of dormant links available 
for the healing. A healing protocol is any algorithm that, by 
activating (waking up) a subset Ew^Ep, of dormant edges, finds a 
maximal tree T' of G' = {V,Ep,{J(Ea— Ep)) containing the 
source Vs- If T' is spanning, then the system has fully recovered. 
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Figure 3. Comparison among different network structures. Here 
we show the performances of our self-healing algorithm with respect to 
the quality of service for increasing number of removed links with the 
redundancy r fixed to 0.3; for SW networks, the rewiring probability is 
p = 0.2. The average fraction of nodes {FoS} of served nodes is plotted 
against the number of failures k. 
doi:1 0.1 371 /journal.pone.0087986.g003 

For the robustness of the algorithm, we will assume that nodes 
have only a local knowledge of the networks - i.e., only about the 
state (active, dormant or failed) of their incoming links. To build 
the maximal connected tree, nodes communicate with their 
neighbors via a suitable distributed protocol allowing fault nodes 
to join the active network by activating dormant edges. In other 
words, nodes are endowed only with the minimal requirements of 




(a) Grid (b) Watts-Strogatz 



routing needed to reconstruct a spanning tree [34] . In this paper, 
we have applied the following simple distributed algorithm to 
implement self-healing: 

N=(V,Vs,Ea,Ed)*- initial configuration 

Eft- failed links 

U <- unserved nodes 

V'^V-U 

E' A <-(E A —E F )r\(V x V') 
E' D ^(E D -E F )\J(E A -V'x V) 
E w =0 
repeat 

for all veU do 

choose a random neighbor a(v) connected to V through 
any edge of E' D 

for all veU do 
if a(v)^0 then 
V'= V' + {v} 
U=U-{v} 
E' A =E' A + {(a(v),v)} 
E' D =E' D -{(a(y),v)} 
E' w = E' w + {(a(v),v)} 
until (V'x U)r\E' D = 0 
E' D <-E' D r\(V' x V') 
return N' = ( V ,v s ,E' A ,E' D ) 

By definition, the nodes in N 1 are the set of served nodes V . 
Notice that the state (V ,vs,E A +Ew — Ef,Ed — Ew) still de- 
scribes a network infrastructure; therefore, we can in general 
describe the state of the system at time t by the quadruple 
( V(t),vs,E A (t), E])(t)) and the sequence of time failures between 




(c) Barabasi-Albert 




(d) Grid (e) Watts-Strogatz (f) Barabasi-Albert 



Figure 4. Different network topologies. Upper panels, from left to right: planar square grid (SQ), small-world network (SW) generated 
according to Watts-Strogatz [17] and scale-free network (SF) generated accordint to Barabasi-Albert [16]. Lower panels: random spanning trees 
associated with the related underlying topologies in the upper panels. 
doi:1 0.1 371 /journal.pone.0087986.g004 
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time t and t+1 by Ef(t). A general representation of such a 
process can be given in terms of time varying graph [35] . 

Simulations 

We analyse the response of the system to ^-failures. To study the 
effects of different topologies, we perform simulations on different 
grids (fig.4 - upper panel). 

Classically, engineered systems (especially in the power industry) 
are built N —I robust, i.e. they can survive to the single failure of 
any of their components. While checking N—l robustness 
corresponds to checking all the N possible single failures, checking 
the N — k robustness requires to consider N\/(N — k)\~N k cases. 
Therefore, checking the N — k robustness is infeasible even for 
modest values of k due to the combinatorial explosion of the 
number of possible cases. Thus, we choose to assess on 
probabilistic ground whether a system would be able to sustain 
k failures by a Monte Carlo investigation of the space of possible 
failures. 

Service operators are interested in maintaining their service 
level agreements (contracts) with their customers; to such an aim, 
customers must in first pace remain connected to the services. 
Therefore, we calculate the average fraction of served customers 
FoS after the occurrence and the healing oik random failures. To 
do so, we choose at random k different links on the service tree 
and delete them; after that, we apply the self healing procedure; 
finally, we calculate the FoS as the fraction of nodes connected to 
the source. We average such procedure over several network 
realizations until the relative error of the average FoS is small 
enough (less than 5%). As an example, for a grid of 10 4 nodes, we 
must typically average over 100 sets of random failures to attain 
the desired accuracy. Moreover, to average out the different 
characteristics of the initial configurations, we repeat the 
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procedure over 100 different independendy generated initial 
configurations. 

To generate a random spanning tree T associated to a graph G 
(fig.4 - lower panel), we apply the exact algorithm of Wilson [19] 
that samples uniformly the elements of Rt(G). Such spanning 
trees are taken as the initial configurations for our model 
distribution networks. The links of the graph G that do not 
belong to the initial configuration T form the set Eg of the possible 
backup links of our system; of such links, only a subset En (the 
dormant links) can be used to heal the system. The fraction 
r= \Ed\/\Eb\ of such dormant links characterizes the redundancy 
of the system: for r = 0 there are no links in Ed and any failure 
splits the tree, while for r = 1 any of the links of G can be used to 
recompose the system. 

Notice that in our case it would be more correct to speak about 
N — k resilience, since we don't consider whether the system is robust 
to k failures (i.e. whether it still functioning after k failures), but if it 
can recover from k failures. 
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